The data were generated by the Duke University MS Core. The data were delivered as quality controled excel files. Data were modified to follow typical tabular format for easy loading into R.
ms.data <- read.delim("data/preprocessed.csv", sep = ",", check.names = F)
head(ms.data[, 1:10])
## Sample Timepoint Group Sex Age Treatment Infection VL C0 C2
## 1 M625 -21 IL21 F 674 None Naive 0 25.1 4.83
## 2 M625 7 IL21 F 674 None SIV 5369 33.7 12.90
## 3 M625 70 IL21 F 674 None SIV 39039 28.4 6.78
## 4 M625 91 IL21 F 674 IL21 SIV 50530 35.0 9.68
## 5 DHAE -21 IL21 M 1043 None Naive 0 56.4 10.40
## 6 DHAE 7 IL21 M 1043 None SIV 107360 26.9 15.70
We next need to format some of the columns so that their data types are correct. Namely converting the groups to a factor and the viral loads to a numeric.
ms.data$Group <- factor(ms.data$Group, levels = c("Control", "IL15", "aIL15", "IL21"))
ms.data$VL <- as.numeric(ms.data$VL)
The final step of these initial analyses is to define the amino acids that we are interested in. For this analysis we will examine all 20 amino acids and their changes during acute and chronic SIV infection.
amino.acids <- c("Ala", "Arg", "Asn", "Asp",
"Cys", "Glu", "Gln", "Gly",
"His", "Ile", "Leu", "Lys",
"Met", "Phe", "Pro", "Ser",
"Thr", "Trp", "Tyr", "Val")
Finally, we will load all the R libraries used for this analysis set.
require(ggplot2)
require(dplyr)
require(rpart)
require(rpart.plot)
require(pheatmap)
require(caret)
The first analysis we will perform is a simple heatmap analysis to examine expression of the 20 amino acids during the three timepoints we are interested in: Pre-infection, Acute infection, and Chronic infection.
Since there is a sample that did not have Day -21 and Day 70 timepoints but did have Day 0 and Day 50, we will modify the timepoint labels to include that sample. We will first copy the raw data into a new data frame without the Day 91 timepoint since that is not of interest for this analysis.
hm.data <- ms.data[, c("Sample", "Timepoint", amino.acids)]
hm.data <- hm.data %>% filter(Timepoint != 91)
hm.data$Timepoint <- ifelse(hm.data$Timepoint==70 | hm.data$Timepoint == 50, "Chronic",
ifelse(hm.data$Timepoint == 7, "Acute", "Pre"))
# Set the Timepoint as a factor and order the levels correctly
hm.data$Timepoint <- factor(hm.data$Timepoint, levels = c("Pre", "Acute", "Chronic"))
Next, we need to modify the hm.data data frame into a matrix format for the pheatmap function.
# First let's make sure the timepoints are ordered correctly
hm.data <- with(hm.data, hm.data[order(Timepoint),])
# Next we need to transpose the data frame since pheatmap wants features in rows and samples in columns
hm.data <- as.data.frame(t(hm.data))
# Here we are using sapply to make sure all the columns are of a numeric class
mat <- as.matrix(sapply(hm.data[3:nrow(hm.data),], as.numeric))
# Set row name to the Amino Acid names
rownames(mat) <- rownames(hm.data)[3:22]
# convert our matrix to log2 for better visualization
mat <- log2(mat)
# Create annotation data frames that link the columns and rows to their correct identifiers
my_col <- as.data.frame(t(hm.data[2,,drop=F]))
# We also need to reset the factors for this
my_col$Timepoint <- factor(my_col$Timepoint, levels = c("Pre", "Acute", "Chronic"))
my_row <- as.data.frame(row.names(hm.data)[3:22])
Now that we have our data in the correct format we can simply pass these to the pheatmap function to visualize.
# Creating a custom color ramp because I don't like the default one in pheatmap
colors <- c(min(mat),seq(-2,2,by=0.01),max(mat))
my_palette <- c("green",colorRampPalette(colors = c("blue", "white", "red"))
(n = length(colors)-3), "red")
# We will cluster the rows because we are interested in clusters of amino acids potentially predicting viremia outcomes later on
# Gaps are used on both columns and rows for easier visualization of groups
pheatmap(mat,
annotation_col = my_col,
cluster_rows = T,
cluster_cols = F,
show_colnames = F,
scale = "row",
color = my_palette,
gaps_col = c(20,40),
cutree_rows = 4)
Our dataset is too small for statistically rigorous decision trees but as a proof-of-concept we will perform one anyway.
The first thing we need to do is create a new data frame of just the information we need. In order to do this we are going to take the Chronic timepoint and classify samples as either “Low” or “High” viremia based on if their viremia at chronic infection is less than or greater than the median, respectively.
# Summarize data to obtain mean, media, standard deviation, and standard error
summ.ms.data <- ms.data %>%
filter(Timepoint == 50 | Timepoint == 70) %>%
summarize(avg = mean(VL), n = n(),
std = sd(VL), se = std/sqrt(n),
median = median(VL))
# Set Viral Load Group (VLG) based on median VL
vlg.labels <- ifelse(ms.data[ms.data$Timepoint==50 | ms.data$Timepoint == 70, "VL"] > summ.ms.data$median, "High", "Low")
names(vlg.labels) <- ms.data[ms.data$Timepoint==50 | ms.data$Timepoint == 70, "Sample"]
# Need to set all timepoints to use these grouping labels
for(x in unique(ms.data$Sample)) {
ms.data[ms.data$Sample == x, "VLG"] <- vlg.labels[x]
}
ms.data <- ms.data[, c(1:8, 636, 9:635)]
for(x in colnames(ms.data)[10:636]) {
ms.data[, x] <- as.numeric(ms.data[, x])
}
Next, we will make a new data frame for these analyses and split the data into a testing and training data frame. We are interested in examining pre-infection amino acid profiles to determine if there is a specific amino acid profile that can be predictive of lower viremia.
# Take only the pre-infection timepoints since we are trying to find a predictive amino acid profile
data <- ms.data %>% filter(Timepoint == 0 | Timepoint == -21)
data <- data[, c("VLG", amino.acids)]
for(x in amino.acids) {
data[, x] <- as.numeric(log2(data[, x]))
}
# Split the data into training and testing
set.seed(20220228)
set.seed(20220301)
inTrain = createDataPartition(data$VLG, p = .6)[[1]]
train = data[ inTrain,]
test = data[-inTrain,]
table(train$VLG)
##
## High Low
## 6 6
table(test$VLG)
##
## High Low
## 4 4
Now, we can use rpart to generate a decision tree.
model <- rpart(VLG~.,
data = train,
method="class",
control=rpart.control(minsplit=2, minbucket=1, cp=0.001))
Now let’s plot the decision tree to see what, if any, are the predictive amino acid profiles.
rpart.plot(model, type = 4, extra = 101, tweak = 1, clip.right.labs = F)
Interpreting a decision tree is composed of answering a series of Yes/No questions. For the decision tree generated here we can begin at the top most node and work our ways down the nodes.
Now that we have a decision tree we can attempt to predict the classes of the testing data frame. We only have 6 samples total in this data frame so it is not robust enough for definitive conclusions but may be indicative of SIV protection.
# Predict the test data set
pred <- predict(model, test, type = "class")
table(test$VLG, pred)
## pred
## High Low
## High 3 1
## Low 1 3
# Examine the confusion matrix
confusionMatrix(table(pred, test$VLG))
## Confusion Matrix and Statistics
##
##
## pred High Low
## High 3 1
## Low 1 3
##
## Accuracy : 0.75
## 95% CI : (0.3491, 0.9681)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.1445
##
## Kappa : 0.5
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.750
## Specificity : 0.750
## Pos Pred Value : 0.750
## Neg Pred Value : 0.750
## Prevalence : 0.500
## Detection Rate : 0.375
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.750
##
## 'Positive' Class : High
##
So we can see that with this small data set we achieve 75% accuracy. However, due to the small number of samples this may just be an over fitting problem and is not robust enough to predict real world outcomes.
One thing we are interested in is the kinetics of amino acid concentrations during the course of SIV infection. To visualize this we will simply take our data and melt it into a format useful for rapid plotting.
For this set of visualizations we will utilize a couple extra libraries.
require(reshape2)
require(ggrepel)
require(ggsignif)
require(rstatix)
require(ggpubr)
In the hidden code block below we define a function to plot the selected amino acid.
generate_aa_plot <- function(x, df) {
# Filter input data frame to just be selected variable
tmp <- df %>% filter(variable == x)
tmp2 <- tmp
tmp2$value <- tmp2$value
# Summarize the data, used in plotting the mean and standard deviation ribbons
tmp2 <- tmp2 %>% group_by(Timepoint) %>%
summarize(avg = mean(value), n = n(),
std = sd(value), se = std/sqrt(n)) %>%
mutate(lower.ci = avg - qt(1 - (0.05 / 2), n - 1) * se,
upper.ci= avg + qt(1 - (0.05 / 2), n - 1) * se)
# paired t test with Benjamini-Hochberg correction
stat.test <- tmp %>%
pairwise_t_test(value ~ Timepoint,
comparisons = list(c("-21", "7"),
c("7", "70"),
c("-21", "70")),
paired = T,
p.adjust.method = "BH") %>%
add_xy_position(x = "Timepoint")
stat.test$group1 <- as.numeric(stat.test$group1)
stat.test$group2 <- as.numeric(stat.test$group2)
p <-
ggplot(tmp, aes(x = Timepoint, y = value, group = 1)) +
geom_point(aes(group = Sample),alpha=0.1) +
geom_line(aes(group = Sample),alpha=0.1) +
geom_ribbon(data = tmp2,
aes(x = Timepoint,
y = avg,
ymin = avg - std,
ymax = avg + std),
fill = "steelblue2",
alpha=0.5) +
geom_line(data = tmp2,
aes(x = Timepoint, y = avg),
color = "firebrick",
size = 1,
linetype = "dashed") +
ylab(paste(x,"Concentration (\U03BCM)")) +
scale_x_continuous(breaks = c(-21, 7, 70), labels = c("Pre", "Acute", "Chronic"))
p<- p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))
p <- p + stat_pvalue_manual(data = stat.test, xmin = "group1", xmax = "group2")
print(p)
}
Let’s create a new data frame from the original data for these plots. The significance levels were calculated with the pairwise_t_test function from the rstatix package with the parameters of paired = TRUE and the Benjamini-Hochberg correction for multiple testing.
# We only want to look at the pre-cytokine treatment timepoints
ms.siv <- ms.data
# Let's filter out unneeded info so we can melt the df
ms.siv <- ms.siv[, c(1, 2, 3, 5, 6:632)]
temp.df <- ms.siv %>% filter(Timepoint != 91)
temp.df <- temp.df[, c("Sample", "Timepoint", "Group", amino.acids)]
temp.df$Timepoint <- ifelse(temp.df$Timepoint == -21 | temp.df$Timepoint == 0, -21,
ifelse(temp.df$Timepoint == 7, 7, 70))
temp.melt <- melt(temp.df, id.vars = c("Sample", "Timepoint", "Group"))
Now that we have a good format for the data we will plot every amino acid.
for(aa in amino.acids) {
generate_aa_plot(aa, temp.melt)
}
Another analysis that we want to perform is a simple linear regression to determine if there are relationships between baseline amino acid expression and peak viral loads.
To start, let’s create a new data frame for predictions and we’ll summarize the data to obtain the maximum viral loads for each animal.
# Remove this animal because it did not become infected
ms.data <- ms.data %>% filter(Sample != "N060")
# Take the peak viremia for each sample
peak.vl <- ms.data %>%
group_by(Sample) %>%
summarize(peakVL = max(VL))
# Add in the AA expressions to our peakVL for the pre-infection timepoint
ms.pred <- merge(peak.vl, ms.data[ms.data$Timepoint == -21 | ms.data$Timepoint == 0, c("Sample", "Age", amino.acids)], by = "Sample")
ms.pred <- ms.pred[, c("Sample", "peakVL", amino.acids)]
# Transform AA expression with log2
ms.pred[, amino.acids] <- log2(ms.pred[, amino.acids])
# Transform VL with log10
ms.pred$peakVL <- log10(ms.pred$peakVL)
# Have to filter out some amino acids because we do not have enough n to support full modeling
# So we will take the top 10 highest variance AA
res <- data.frame(AA = amino.acids)
for(aa in amino.acids){
res[res$AA == aa, "var"] <- var(ms.pred[, aa])
}
# We'll put the top 10 results in a new vector
res10 <- res[order(res$var, decreasing = T), "AA"][1:10]
Now that we have a pared down list of amino acids to use in regression we’ll build our model.
f <- as.formula(paste0("peakVL ~ ", paste0(res10, collapse = " + ")))
ms.model <- lm(f, data = ms.pred)
fit.res <- summary(ms.model)
sig.res <- fit.res$coefficients[,4][fit.res$coefficients[,4] < 0.05]
sig.res <- sig.res[-1]
fit.res
##
## Call:
## lm(formula = f, data = ms.pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64231 -0.29511 -0.01698 0.28586 0.70730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.3740 6.0970 4.162 0.00316 **
## Cys 0.2244 0.2752 0.816 0.43839
## Met -0.6363 0.5837 -1.090 0.30738
## Ala 0.3032 0.5125 0.592 0.57050
## Pro -0.9978 0.7923 -1.259 0.24338
## Asp 2.3467 0.7732 3.035 0.01618 *
## Glu -2.4185 0.5965 -4.054 0.00366 **
## Leu -0.3236 0.8968 -0.361 0.72756
## Trp -1.4232 0.6124 -2.324 0.04862 *
## Thr 0.6058 0.7400 0.819 0.43670
## His 0.6195 0.9484 0.653 0.53192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.544 on 8 degrees of freedom
## Multiple R-squared: 0.8253, Adjusted R-squared: 0.607
## F-statistic: 3.78 on 10 and 8 DF, p-value: 0.03592
After running our regression we see that Asp, Glu, and Trp are significant in our linear regression model. Next, we’ll do some plots to examine these relationships.
melt.res <- melt(ms.pred, id.vars = c("Sample", "peakVL"))
melt.res <- melt.res[melt.res$variable%in%names(sig.res),]
coefs <- ms.model$coefficients
coefs <- coefs[names(coefs)%in%names(sig.res)]
dat_text <- data.frame(
label = paste0("coef = ", round(coefs, digits = 4)),
label2 = paste0("p = ", round(sig.res, digits = 4)),
variable = names(coefs)
)
ggplot(melt.res, aes(x = peakVL, y = value)) +
geom_point() +
facet_wrap(.~variable) +
geom_text(
data = dat_text,
mapping = aes(x = 5.6, y = 7.5, label = label)
) +
geom_text(
data = dat_text,
mapping = aes(x = 5.6, y = 7.3, label = label2)
)
And finally let’s annotate the resulting data frame with a grouping factor that basically states that any peak viral load greater than the median is High and anything less than median is Low.
melt.res$Group <- ifelse(melt.res$peakVL > median(melt.res$peakVL), "High", "Low")
melt.res$Group <- factor(melt.res$Group, levels = c("Low", "High"))
ggplot(melt.res, aes(x = Group, y = value)) +
geom_boxplot() +
geom_jitter() +
stat_compare_means(method = "t.test", label.y = 7.3) +
facet_wrap(~variable)
So we can see that even though we have significant relationships in the linear regression model, only Glu showed significant differences between the Low and High group.
Here we will be essentially performing the same analysis as Linear Regression except we will now be grouping the samples first based on peak viral load and performing logistic regression to determine relationships between amino acid expression and groupings.
# Drop this animal because it didn't get infected
ms.data <- ms.data %>% filter(Sample != "N060")
# Take the peak viremia for each sample
peak.vl <- ms.data %>%
group_by(Sample) %>%
summarize(peakVL = max(VL))
# Add in the AA expressions to our peakVL for the pre-infection timepoint
ms.pred <- merge(peak.vl, ms.data[ms.data$Timepoint == -21 | ms.data$Timepoint == 0, c("Sample", "Age", amino.acids)], by = "Sample")
ms.pred <- ms.pred[, c("Sample", "peakVL", amino.acids)]
ms.pred[, amino.acids] <- log2(ms.pred[, amino.acids])
# Add grouping factor for "High" and "Low" peak viral load
ms.pred$Group <- ifelse(ms.pred$peakVL > median(ms.pred$peakVL), 1, 0)
ms.pred$Group <- factor(ms.pred$Group, levels = c(0,1))
Next we will perform the same analyses as before. First we select the top 10 highest variance amino acids and perform glm().
# Have to filter out some amino acids because we do not have enough n to support full modeling
res <- data.frame(AA = amino.acids)
for(aa in amino.acids){
res[res$AA == aa, "var"] <- var(ms.pred[, aa])
}
res10 <- res[order(res$var, decreasing = T), "AA"][1:10]
f <- as.formula(paste0("Group ~ ", paste0(res10, collapse = " + ")))
ms.model <- glm(f, family = binomial, data = ms.pred)
fit.res <- summary(ms.model)
fit.res
##
## Call:
## glm(formula = f, family = binomial, data = ms.pred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.181e-05 -1.862e-06 -2.110e-08 1.959e-06 1.329e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.361e+02 5.914e+06 0 1
## Cys -3.473e+01 3.623e+05 0 1
## Met -1.311e+02 4.547e+05 0 1
## Ala -1.085e+01 1.075e+06 0 1
## Pro 4.584e+01 1.826e+06 0 1
## Asp 6.001e+00 6.932e+05 0 1
## Glu -1.926e+02 5.657e+05 0 1
## Leu -2.502e+02 1.238e+06 0 1
## Trp 6.074e+00 5.930e+05 0 1
## Thr 2.497e+02 8.546e+05 0 1
## His 3.004e+02 1.111e+06 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.6287e+01 on 18 degrees of freedom
## Residual deviance: 6.9096e-10 on 8 degrees of freedom
## AIC: 22
##
## Number of Fisher Scoring iterations: 25
an.res <- anova(ms.model, test = "Chisq")
sig <- row.names(an.res)[an.res$`Pr(>Chi)` < 0.05][2]
an.res
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Group
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 18 26.287
## Cys 1 0.2599 17 26.027 0.6102
## Met 1 0.0004 16 26.027 0.9832
## Ala 1 0.2027 15 25.824 0.6526
## Pro 1 1.3707 14 24.453 0.2417
## Asp 1 0.8626 13 23.591 0.3530
## Glu 1 23.5907 12 0.000 1.192e-06 ***
## Leu 1 0.0000 11 0.000 1.0000
## Trp 1 0.0000 10 0.000 1.0000
## Thr 1 0.0000 9 0.000 1.0000
## His 1 0.0000 8 0.000 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that only Glu was significant in our model. We will not examine the expression between the two groups.
ggplot(ms.pred, aes(x = Group, y = ms.pred[, sig])) +
geom_boxplot(aes(fill = Group)) +
geom_jitter() +
ylab(sig) +
scale_fill_discrete(labels = c("0"="Low", "1" = "High")) +
scale_x_discrete("", c(0,1), c("Low", "High")) +
stat_compare_means(method = "t.test", label.x = 1.5, label.y =7)